Introduction

The dataset has 10 variables, including carat, cut, color, depth, table, price, x(length), y(width), and z(depth). It contains 53,940 records of different diamonds. The main variables we will cover in this analysis are carat, cut, color, and price. We will try to find the realtionship between price and other variables.

## # A tibble: 6 x 10
##   carat cut       color clarity depth table price     x     y     z
##   <dbl> <ord>     <ord> <ord>   <dbl> <dbl> <int> <dbl> <dbl> <dbl>
## 1 0.23  Ideal     E     SI2      61.5    55   326  3.95  3.98  2.43
## 2 0.21  Premium   E     SI1      59.8    61   326  3.89  3.84  2.31
## 3 0.23  Good      E     VS1      56.9    65   327  4.05  4.07  2.31
## 4 0.290 Premium   I     VS2      62.4    58   334  4.2   4.23  2.63
## 5 0.31  Good      J     SI2      63.3    58   335  4.34  4.35  2.75
## 6 0.24  Very Good J     VVS2     62.8    57   336  3.94  3.96  2.48

Variables

Carat

The range of carat is 0.2 - 5.01. The median carat size is 0.7, and mean carat size is 0.8. From the histogram, we can see that most diamonds’ carat range are in 0.2 - 0.3, and 0.9 - 1.0. Around 75% diamonds are not greater than 1 carat.

summary(diamonds$carat)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.2000  0.4000  0.7000  0.7979  1.0400  5.0100
range(diamonds$carat)
## [1] 0.20 5.01
ggplot(data = diamonds) + 
  geom_histogram(mapping = aes(x = carat), binwidth = 0.1,fill ="#39568CFF")

There is only few diamonds over 3 carat, we will ignore them and zoom in the carat range from 0 to 3. The peaks are usually around .0, 0.25, 0.5 & 0.75.

diamonds %>% filter(carat < 3)%>%
  ggplot()+
  geom_histogram(mapping = aes(x = carat), binwidth = 0.01,fill ="#39568CFF")

### Cut Most diamonds have ideal cut, followed by premium cut and very good cut. From the pie chart, we can see the percentage more clearly. 40% are ideal, 25.6% are premium, and only under 3% are fair cut.

ggplot(data = diamonds) + 
  geom_bar(mapping = aes(x = cut, fill = cut))

p <- c( '#440154FF','#39568CFF','#20A387FF','#95D840FF','#FDE725FF')
plot_ly() %>%
  add_pie(data = count(diamonds, cut), labels = ~cut, values = ~n,
  name = "Cut", domain = list(row = 0, column =0), marker = list(colors = p))
ggplot(data = diamonds) + 
  geom_boxplot(mapping = aes(x= cut, y = carat))

### Color

Color is categorized from J to D, J is the worst and D is the best. Color G has the most among all colors.

ggplot(data = diamonds, mapping = aes(x= color))+
  geom_bar(aes(fill=color))

diamonds %>%
  count(color, cut) %>%
  ggplot(mapping = aes(x = color, y = cut)) +
  geom_tile(mapping = aes(fill = n))

Clarity

Clarity scale from worst to best is: I1(worst), SI2, SI1, VS2, VS1, VVS2, VVS1, IF(best).

ggplot(data = diamonds, mapping = aes(x= clarity))+
  geom_bar(aes(fill=clarity))

### Price Price is from 326 to 18,823. Most of the diamonds are under $5,000 in this dataset.

ggplot(data = diamonds) +
  geom_histogram(mapping = aes(x = price), binwidth = 1000,fill ="#39568CFF")

ggplot(
  data = diamonds,
  mapping = aes(x =price , y = ..density..) )+
  geom_freqpoly(mapping = aes(color = cut), binwidth = 500)

Relationships

Carat & Price

Generally speaking, price is increasing as the carat goes up.

ggplot(data = diamonds) +
  geom_point(mapping = aes(x = carat, y = price))

ggplot(data = diamonds) + 
  geom_point(
    mapping = aes(x = carat, y = price),
    alpha=1/100 
    )

ggplot(data = diamonds) +
  geom_hex(mapping = aes(x = carat, y = price))

ggplot(data = diamonds, mapping = aes(x = carat, y = price)) + 
  geom_boxplot(mapping = aes(group = cut_width(carat, 0.1)))

Cut & Price

ggplot(data = diamonds, mapping = aes(x = cut, y = price)) + geom_boxplot()

diamonds %>%
  ggplot(aes(log(carat),log(price), col= cut))+
  geom_point()

Color & Price

diamonds %>%
  ggplot(aes(carat,price, col= color))+
  geom_point()

diamonds %>%
  ggplot(aes(log(carat),log(price), col= color))+
  geom_point()

Clarity & Price

From the scatter plot, we can see that at certain carat, I1 is always

diamonds %>%
  ggplot(aes(carat, price, col= clarity))+
  geom_point()

diamonds %>%
  ggplot(aes(log(carat), log(price), col= clarity))+
  geom_point()